HCS—hierarchical algorithm for simulation of omics datasets

Abstract Motivation Analysis of the omics data with the help of machine learning (ML) methods is limited by small sample sizes and a large number of variables. One possible approach to deal with such data is using algorithms for feature selection and reducing the dataset to include only those variables that are related to the studied phenomena. Existing simulators of the omics data were mostly developed with the goal of improving the methods for generations of high-quality data, that correspond with the highest possible fidelity to the real level of molecular markers in the biological materials. The current study aims to simulate the data on a higher level of generalization. Such datasets can then be used to perform tests of the feature selection and ML algorithms on systems that have structures mimicking those of real data, but where the ground truth may be implanted by design. They can also be used to generate contrast variables with the desired correlation structure for the feature selection. Results We proposed the algorithm for the reconstruction of the omic dataset that, with high fidelity, preserves the correlation structure of the original data with a reduced number of parameters. It is based on the hierarchical clustering of variables and uses principal components of the clusters. It reproduces well topological descriptors of the correlation structure. The correlation structure of the principal components of the clusters then is used to obtain datasets with correlation structures similar to the original data but not correlated with the original variables. Availability and implementation The code and data is available at: https://github.com/p100mma/hcrs_omics.


Introduction
Modern methods of molecular biology, collectively known as omics generate data on biological processes with very high resolution-such as, e.g.expression levels of individual genes in the individual cells.The datasets obtained with such methods help researchers elucidate the mechanisms of cancer or genetic diseases.Due to the very high cost of experimental studies, and legal and ethical concerns, the sample sizes arising in such experiments are most often small, whereas the number of variables can reach hundreds of thousands.Therefore, the analysis of such data often involves the use of specially designed protocols that utilize feature selection and machine learning (ML) algorithms.Validation of such protocols is crucial, especially when one deals with black-box ML algorithms.One of the possible approaches to validate them is to test them on datasets with properties similar to the experimental ones, but where the ground truth is known, hence the true performance can be measured.To this end, generated data should have statistical properties (known a priori) resembling the ones of reference datasets.Multiple tools for the simulation of read distribution in the single sample have been developed, in particular for the single cell transcriptomics, the detailed review of the field is given in the article by Sun and coworkers in an article introducing scDesign2 tool (Sun et al. 2021).They aim at the generation of datasets that can be used as a benchmark for analysis of the single cell data.To our knowledge, the only tool available for simulations of transcription for the bulk data is the WGCNA package (Zhang and Horvath 2005).However, its primary goal is the analysis of gene expression, and simulations are of secondary importance.The structure of simulated datasets doesn't reflect well the complex correlation structure of gene expression, see Figs 2-4.
The classical method for generating data correlated according to a known covariance matrix Σ uses Cholesky Decomposition (Press et al. 2007), which requires Σ to be positive definite (PD) (Higham 2009).When the number of samples is larger than the number of variables, the estimate of Σ is not PD.There are methods of dealing with this problem, such as "repairing" techniques that can find the nearest PD approximation (Higham 2002), but these methods have problems with the interpretability of the results.However, problems of interpretability as well as estimation in the presence of high-dimensionality and noise can be dealt with using unsupervised approaches.
Simulating marginal distributions of genes alone is much easier.Techniques such as inverse transform sampling (Press et al. 2007) can be used along with parametric distribution fitting, to generate new random variates which replicate reference one-dimensional distributions.Coupling probability integral transform with quantile function transformations (inverse transform) can be used to estimate both marginal distributions and latent covariance matrix, as was done in Sun et al. (2021).

Clustering-based approaches in biology
Clustering can be used either for purposes of grouping samples into sub-types (like deriving cell-types in single-cell sequencing Xu andSu 2015, Stuart et al. 2019) or for grouping genes themselves, into highly correlated clusters, for purposes of identifying gene "modules" (D'haeseleer 2005, Zhang andHorvath 2005).For example, in the context of genome-wide gene expression profiling, modules are groups of genes with similar expression profiles, often co-regulated and functionally related (Saelens et al. 2018).While some concerns about finding modules by clustering methods have been raised (Saelens et al. 2018), correlation-based clustering methods are still widely used to infer new knowledge from experimental data in biomedical applications-co-expressed genes typically are involved in the same processes (van Dam et al. 2012) and therefore-clusters of genes derived from data alone can be used to find relevant biomarkers (Tian et al. 2020, Feng et al. 2022).Gene Ontology Enrichment (Huang et al. 2009) is often another step in the analysis, and clustering that produces better enrichment scores tends to be preferred by researchers (Song et al. 2012).

Weighted graph clustering
A graph model of data, in which entities are nodes in a graph (network), is often preferred for the analysis of biological data.Grouping is then done based on the topological characteristics of the graph.A model of a weighted network also encodes the strength (and possibly the sign) of the relations between nodes.Numerous approaches to the problem exist, for details, we refer to Fortunato (2010) and Fortunato and Hric (2016).Notable examples include (Rosvall andBergstrom 2008, Van Dongen 2008), some have found use in the biological domain (Shih andParthasarathy 2012, Stuart et al. 2019).(Dis)similarity-based methods, such as classical bottom-up hierarchical clustering algorithm, can also be applied under the graph model (Fortunato 2010).
Two methods which we use in this research are the Markov Clustering Algorithm (MCL) (Van Dongen 2008) and WGCNA framework (Zhang andHorvath 2005, Langfelder andHorvath 2009).MCL finds clusters from an equilibrium state of an enhanced random walk, in which stronger probabilities are amplified (by a process called inflation).In WGCNA, one uses a transformed value of correlation coefficient to group variables (typically genes) into clusters interpreted as "modules."Apart from a robust clustering protocol, WGCNA offers methods of analysis especially suited for networks derived from correlation (Zhang and Horvath 2005).WGCNA also uses cluster-wise Principal Component Analysis (PCA) to derive an "Eigengene," a latent factor summarizing the characteristics of the correlated genes in the module (Langfelder and Horvath 2009).

PCA in the low sample, high variable setting
In classical PCA (Jolliffe 2002), one finds coordinates of the data in a different set of axes, where the first axes carry most of the overall variance.Each PC depends on each original variable in the dataset, which is not ideal for interpretation in gene expression data.This has led to the development of sparse PCA methods (Zou et al. 2006).WGCNA deals with the problem by performing PCA cluster-wise.
By using only k PCs in the decomposition, we can obtain an approximation to the original dataset, which is not independent of it (Halko et al. 2011).However, including some randomization in the process can lead to a basic simulation method of the data.A simple heuristic similar to the method just described exists in WGCNA R (Langfelder and Horvath 2009) package, which can be used for simulating data with correlation described by input clustering.The aim of this article is to develop an improved simulation procedure that also possesses the intuitiveness of the original WGCNA approach, but can produce much more realistic results.

Materials and methods
As in WGCNA methodology, we propose to utilize a correlation based clustering of variables and to perform PCA in each cluster separately.This is consistent with the intuition that a set of highly correlated variables can be accurately described by a structure of lower dimensionality.Such decomposition requires using more PCs than a classic approach, but it better reflects complicated correlation structure in the data.
To explain a fraction total variance in the dataset, we could explain a fraction of the total variance of each cluster.If the clustering we use is really fine-grained, then the required number of PCs to use in each cluster to reach the target variance can be satisfactorily small.However, a large number of clusters is still difficult to interpret, if we do not know at least if they are related in some manner.That's why we propose a hierarchical approach.
The general procedure for the simulation is described in Fig. 1.

Definitions and notation
We assume that dataset is described by set of N variables X , each X i determined by the set of indexes I : X ¼ fX i ji 2 Ig.We define clustering P ¼ fC 1 ; C 2 . ..g as a partition of indexes: . Hierarchical clustering is a sequence of (nested) clusterings, in which the sets of the P l þ 1 subdivide clusters of the previous clustering P l .V X Ci will mean sum of variances of variables X j from X with j 2 C i , i.e.V X Ci ¼ If we omit the subset of indexes in the notation, like V X , that means the sum runs over all elements of X .
Consider now set of PCs PC 1 ; PC 2 . . .PC M of X , which were calculated separately for each cluster in P. Note that in contrast to original PCA, Var ð since PCs of different clusters might be correlated.However, such property holds locally in each cluster:

Simulating omics datasets ii99
Nevertheless, one can reconstruct each variable X i by using only k PCs of the cluster to which it belongs.We will call such reconstruction a kth order reconstruction of X based on P.
In the HCR algorithm, we extend this construction to be compatible with hierarchical clustering.The algorithm uses several mathematical facts, which make it work.See the Supplementary Material for derivations.

Hierarchical clustering based reconstruction algorithm-HCR
We describe the HCR algorithm, for reconstructing data based on PCA combined with hierarchical clustering.The main input parameters are: limiting fraction of variance to explain: f E 2 ½0; 1�, k-maximum number of PCs to use per cluster.
The algorithm uses two methods for clustering variables based on their correlation: an initial "adaptive" clustering algorithm and a "separator" algorithm.These can be any methods that satisfy the following criteria.The initial "adaptive" algorithm should be able to detect clusters of various sizes, and preferably be able to perform initial prefiltering: it should classify some variables to a "noise" subset, denoted as C 0 (not real clusters), irrelevant to the overall correlation structure of the rest of the network (noiseless part, union of "proper" clusters, denoted C i6 ¼0 ).C 0 could be variables distant from every other variable or low variance ones.In contrast, a "separator" should produce a complete partition of the set of variables.
"Adaptive" algorithm produces the first partition P 1 .The "separator" algorithm is used to produce further subdivisions P 2 ; P 3 . . . of initial clusters, based on correlation of the "residuals." The general steps of the HCR algorithm are then: 1) Initial division of X by an adaptive clustering algorithm, producing partition , stop-there is not enough variance in the noiseless part of the dataset for given f E .Otherwise, calculate X 1 -kth order reconstruction of X based on P 1 .2) For each C j 2 P 1 with unexplained variance left, subdivide it based on residuals E 1 ¼ fX i − X 1 i g in C j ; after processing each C j , produce P 2 and X 2 , a kth order reconstruction of residuals

3) Continue subdivisions of clusters with unexplained vari-
ance until convergence of the reconstruction is satisfactory.
The exact technical description of a generalized version of the algorithm is moved to Supplementary Material.Here we provide a general overview and some practical remarks.The final reconstruction of each X i is set to X R i :¼ , where g is the number of levels of hierarchy used.Each X R i is a linear combination of at most g × k PCs of clusters to which i belongs.If k ¼ 5, g ¼ 2, P 1 contains 3 clusters, and P 2 11, assuming no cluster was explained in P 1 fully, this gives 70 PCs in total, 5 per (sub)cluster.However, each reconstructed variable is a linear combination of at most 10 (5 þ 5) generating PCs (which are independent).The set of all PCs and coefficients used to reproduce variables in HCR is further referred to as a hierarchical clustering-based decomposition (HCD) of the dataset X.In principle, it is always true for the reconstructed variable X R i that (see Supplementary Material for derivation) (2) so as an internal convergence criterion, we measure total variance explained as V X R =V X .The last step (continue until exhaustion) can be too excessive in practice.Our experiments on real data suggest that even after two to three levels of clustering, the remaining signal can be regarded as noise.Therefore, we recommend applying a fixed number of subdivision steps, then examining the results and eventually continuing further based on that.
Currently, for a "separator" algorithm, we use hierarchical clustering based on the complete linkage method (with initial distance between residuals i, j calculated as 1 The best split is chosen from several candidate ones by maximizing the normalized entropy of a division: ð − P ng i¼1 p i log p i Þ=ðlog n g Þ. n g is the number of clusters in the split, i ranges over different subclusters in the split, and p i is the proportion of the subcluster in the whole split.This quantity is the biggest for an even split.This combination deals with the empirically observed tendency of the clustering algorithms, which tend to split initial clusters into one big sub-cluster and periphery, gathered into much smaller sub-clusters. HCR is used to model a latent correlation structure of really high-dimensional datasets but is not a simulation.However, it can be used to obtain a simulation method with several desirable properties: 1) it realistically replicates correlations of the original dataset to the known degree of accuracy, 2) it learns the general structure of real data, and summarizes it with a smaller set of latent factors, 3) it produces variables with known parametric distributions, 4) it produces new, independent synthetic data samples with real-data-like N-variate distribution.

HCS simulation procedures
Each variable in HCR is a lin.combination of PCs: X R j ¼ In HCS method, to simulate variable X S j , we use same equation but replace original PCs with synthetic ones.Correlations between synthetic PCs are identical to correlations between original ones, but a synthetic set of PCs is independent of the original one.
We propose two simulation methods for producing variables with a similar correlation structure to the reference data.Both require HCD of real data as an input and model covariance of associated PCs.In the first one, each simulated variable follows a known normal distribution.We refer to it further as HCS(n).The second one, HCS(f), is an extension of the first one, where we also model marginal distributions of PCs."f" stands for "fitted" distributions of PCs.HCS(f) requires a choice of a parametric family of distributions to model the associated PCs.Since shapes of distributions of PCs can be difficult to predict, we propose to use a metalog family of distributions (Keelin 2016), which are flexible, exhibit a closed form quantile function, are fit by a linear procedure and allow for modelling multimodal distributions (Keelin 2016).
At the last step of HCS, an amount of random noise is added to each simulated variable, to make its variance equal ii100 Stomma and Rudnicki to its real counterpart.This utilizes the relation in Equation (2) and is an important factor for making the simulation realistic (see Figs 2 and 3).If HCR is to be used as a faithful replication of real data, this step should also be included.We refer to the Supplementary Material for computational details of procedures.

Implementation
To investigate the performance of the proposed algorithm, we have run a series of computational experiments on highdimensional gene expression data.As an initial test dataset, we used a dataset of RNA-seq gene expression profiles of breast cancer patients (BRCA) (Pereira et al. 2016).After initial prefiltering and preprocessing (as done in Polewko-Klim et al. 2021), the set consists of 1394 samples, described by 8673 gene expression profiles.This dataset contains arguably a larger number of samples than usually seen in typical applications.Therefore, we have run additional tests on a smaller (605 samples), well-known KIRC dataset from TCGA, containing highly correlated RNA-seq gene expression levels taken from kidney cancer study (Peng et al. 2015).After standard preprocessing (as in Polewko-Klim and Rudnicki 2020), for modelling, we used 15 166 genes with the highest variance.
All relevant code was written in R language and is available in the public repository (https://github.com/p100mma/hcrs_omics), along with mentioned preprocessed input.
In the first step, we compared our method with two reference approaches on BRCA data: SVD reconstruction and WGCNA simulation.For plain SVD-based reconstruction of real data without clustering, we have used 10 and 70 PCs.70 corresponds to the total number of PCs we use in our method, and 10 is the number of PCs used in our method to replicate each variable.We also compare against SVD reconstruction with added noise as in the last step of the HCS procedure.
We have also adopted an initial simulation heuristic implemented in R WGCNA package (Langfelder and Horvath 2009), which also uses clustering as an input.For the basis of this simulation method, we have used a version of the standard clustering protocol proposed by WGCNA (Zhang and Horvath 2005), utilizing in the end stage publicly available R Figure 2. Clustering coefficient (top row) and weighted degree distributions (bottom row) for different methods of reconstruction and simulation of the BRCA dataset (yellow/brighter histograms).The distribution of the clustering coefficient in the original dataset (blue/darker histograms) is shown for comparison.The labels: SVD10 and SVD70 denote a reconstruction of the dataset with 10 and 70 singular vectors, respectively; HCR-X-Y-Z, HCS(n)-X-Y-Z, and HCS(f)-X-Y-Z correspond to the reconstruction with the HCR algorithm, simulation with the HCS algorithm using the normal distribution of principal components, and simulation with the HCS algorithm using the fitted distribution of principal components, respectively, parameters X, Y, and Z are the number of principal components used in reconstruction, the number of clusters at the first level of the hierarchy, and the number of clusters at the second level of the hierarchy, respectively.The þN label denotes that the noise was added to the reconstruction or simulation.Note that different scales for the X-axis are used in the individual plots.

Simulating omics datasets ii101
package: dynamicTreeCut (Langfelder et al. 2016).Since the original simulation function from the WGCNA R package was not suited for real simulation studies, we benchmark against a modified version.For initial clustering in our method, we have chosen the MCL algorithm based on the square of Pearson correlation.We have used the original C implementation of the MCL algorithm made by its original author (Van Dongen 2008).Clusters significantly smaller than the rest were set as "noise."In further divisions using the complete linkage method, we chose the best split out of splits into a number of clusters ranging from 2 to 7. Lastly, for fitting distributions in simulation method HCS(f), we utilized R package rmetalog (Faber and Jung 2021).For additional tests on the KIRC dataset, we have used a similar approach, utilizing two layers of hierarchy of clusters, maximal number of PCs per cluster set at 5, yielding 93 PCs used in total and 10 PCs used per variable.Details of the modified WGCNA simulation and the parameters of the methods used are described in the Supplementary Material.

Results
Several metrics were used for the analysis of results.The basic ones for the reconstructions are the fraction of variance of the original data explained by the reconstructions, the similarity between the original and reconstructed variables, and the difference between the original correlation structure of the data and the reconstruction.In the case of simulations, the only metric that should be carried over from the reconstructions is the difference between the simulated and real correlation structure.In contrast with reconstructions, the correlations between simulated and original variables should be indiscernible from the noise.However, the general topological properties of the correlation structure of both reconstructed and simulated variables should be very close to that of the original dataset.
The numerical metrics comparing the overall structure of HCR/HCS data to the reference, for varying numbers of PCs and levels of hierarchical reconstruction, are shown in Table 1.The first column differentiates between reconstruction and simulation types.Five next columns describe the parameters of the reconstruction/simulation, the seventh column shows the fraction of variance explained by the model, and in the last three columns, three following measures of the distance between the model and reference are shown: maximal and average of the squared correlation computed between all variables from the real data and simulations/ reconstruction as well as RMSE computed between correlation matrices.
For brevity, for the BRCA dataset, the data for reconstructions without noise are shown only for models obtained with two principal components per cluster at each level.Trends observed here are also observed in models with a larger number of PCs.A full table showing parameters using more layers than 2 can be found in the Supplementary Material.As can be expected, both using multiple layers of the clustering hierarchy and using multiple PCs at each level improves the level of variance explained by the model, and decreases the RMSE between correlation matrices.Both actions also increase maximal correlations and decrease average correlations between original and reconstructed data.Too simple models don't reproduce the original variables well and also contain too many weak spurious correlations.As expected, adding noise to the system decreases the level of spurious correlations and, at the same time, significantly decreases RMSE between the original correlation matrix and one computed for reconstructed/simulated data, by removing the spurious correlations.For the KIRC dataset, we report at the bottom of the table the version with g ¼ 2; k ¼ 5, the one which is also shown on plots.We note that here, the clustering was able to capture the total variance better than in the BRCA dataset (0.55 versus 0.48), and the algorithm already explained the full variance of some of the clusters (which is evident, e.g. from that total of 93 PCs was sufficient instead of 95).This is also the reason why we observe lower differences between the "noisified" and "clean" versions.The simulations of new data, as we can see in rows beginning with "S(f)" or "S(n)," are close to the real data in terms of correlation matrix on the same level as reconstructions, while their overall correlation with real data is effectively zero.

Correlation networks in real and synthetic datasets
The comparison of network topology summary statistics between reference, reconstructed, and simulated data is shown in Figs 2 and 3. Weights in constructed networks were based on squares of correlation coefficients.Distributions of topology descriptors for our methods are really similar inside each of the noiseless/noisified groups of methods.Here we show some of them, see Supplementary Material for complete set.The top row on both figures displays the comparison of the distribution of the clustering coefficient (cc) (Zhang and Horvath 2005) in between the reference data (in blue) and variants of synthetic/reconstructed data.One can immediately see that noiseless reconstructions, neither with For column þ N:1 means added noise, 0 means no noise added.V E -a fraction of variance explained.R mx , R avg are, respectively, maximal and average square of correlation coefficients computed between every X i from the original dataset and X 0 j from the reconstructed/simulated version.CR-RMSE computed between a correlation matrix computed for the original and one computed for reconstructed/synthetic data. ii102 Stomma and Rudnicki straightforward SVD nor with our hierarchical approach, don't agree well with the reference.The ccs of datasets reconstructed from the PCs are significantly higher than those observed in the original data.On the other hand, adding noise to the system leads to distributions that are either very close or indistinguishable.Identical behaviour is also observed for the second metric describing the properties of the correlation networks, namely the weighted degree (Zhang and Horvath 2005), displayed in the bottom row of Figs 2 and 3. Topological descriptors of datasets obtained from noisified variants of reconstructions or HCS algorithm display behaviour identical to the reference data, while the distributions for the datasets obtained with the WGCNA simulations visibly diverge from the reference.This shows that using 1 PC of each cluster is not sufficient, as in the case of basic WGCNA simulation protocol.
The differences between methods can also be observed in the general appearance of the dendrograms arising in the hierarchical clustering of produced variables, see Fig. 4. Dendrograms for datasets obtained from reconstructions and HCS with added noise look very similar to reference, noiseless variants have visibly more rough structure, whereas simulation with WGCNA reflects the overall structure of data, but the resulting dendrogram is too smooth-lacking internal structure within individual branches.

Discussion
HCS produces faithful representations of the original datasets and allows for the generation of datasets that are not correlated with the original, yet reproduce its general structure.However, one should note, that the simulation method leads to replication of the covariance of the generating PCs, which is an inherently linear operation, hence we may miss nonlinear relations that are present in the original data.Importantly we show that reconstruction alone, without substituting leftover variance with independent noise, is not realistic in itself.An interesting question, that is worth investigating is what ratio of signal to noise is required to get a realistic correlation network topology or what is the suspected "true" noise percentage in the overall variance and how to determine it in a statistical manner.
Our results show that the intuitive WGCNA cluster-wise PCA method indeed can be extended for building more realistic models.It brings an interesting potential for including interpretability in the faithful data simulation process.HCR and HCS, in general, can be used with any clustering method, implying that the procedures could be used not only for the validation of supervised approaches but also for assessing the degree to which a clustering algorithm can effectively capture the hierarchy of correlation structure well-one could argue that the fewer layers and clusters per fixed constraint on PCs one needs, the more these clusterings are descriptive of the correlation as a whole.

Figure 1 .
Figure 1.Schematic representation of HCR algorithm (four boxes on top) and HCS simulations (two bottom boxes).

Figure 3 .
Figure 3. Weighted degree and clustering coefficient distributions for different methods of reconstruction and simulation of the KIRC dataset.The distribution for the original dataset is shown for comparison.The layout, labels, and colours are as in Fig. 2.

Figure 4 .
Figure 4.The dendrograms of the hierarchical clustering for the original data and various reconstruction and simulation schemes, for both datasets.The top row shows results on BRCA data, and the bottom one-for KIRC.REF is "reference"; other labels are as in Fig. 2. Hierarchical clustering displayed is based on topological overlap measure (Zhang and Horvath 2005) derived from squares of correlations.

Table 1 .
Reconstruction accuracy for varying parameters of the model.PCs per cluster, number of levels in hierarchical clustering.n C ; n PC -total number of clusters, total number of PCs.